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Abstract 

We propose a new method for unconstrained optimization of a smooth and strongly convex 
function, which attains the optimal rate of convergence of Nesterov’s accelerated gradient 
descent. The new algorithm has a simple geometric interpretation, loosely inspired by the 
ellipsoid method. We provide some numerical evidence that the new method can be superior 
to Nesterov’s accelerated gradient descent. 


1 Introduction 

Let / : M” ^ M be a /9-smooth and a-strongly eonvex funetion. Thus, for any x,y & we have 

f{x) + Vfix^iy - x) + ||t/ - xp < f{y) < f{x) + Vfix^iy - x) + ^\y - xp. 

Let K = /9/a be its eondition number. It is a one line ealeulation to verify that a step of gradient 
deseent on / will deerease (multiplieatively) the squared distanee to the optimum by 1 — I/k. 
In this paper we propose a new method, whieh ean be viewed as some eombination of gradient 
deseent and the ellipsoid method, for whieh the squared distanee to the optimum deereases at a rate 
of (1 — 1 /\/k) (and eaeh iteration requires one gradient evaluation and two line-searehes). This 
matehes the optimal rate of eonvergenee among the elass of first-order methods, [Nesterov(1983), 
Nesterov(2004)]. 

1.1 Related works 

Nesterov’s aeeeleration (i.e., replaeing k by ^/k in the eonvergenee rate) has proven to be of funda¬ 
mental importanee both in theory and in praetiee, see e.g. [Bubeek(2014)] for referenees. However 
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the intuition behind Nesterov’s accelerated gradient descent is notoriously difficult to grasp, and 
this has led to a recent surge of interest in new interpretations of this algorithm, as well as the rea¬ 
sons behind the possibility of acceleration for smooth problems, see [Allen-Zhu and Orecchia(2014), 

Lessard et al.(2014)Lessard, Recht, and Packard, Su et al.(2014)Su, Boyd, and Candes, Flammarion and Bach(20 
In this paper we propose a new method with a clear intuition and which achieves acceleration. 

Since the function is strongly convex, gradient at any point gives a ball, say A, containing the 
optimum solution. Using the fact that the function is smooth, one can get an improved bound 
on the radius of this ball. The algorithm also maintains a ball B containing the optimal solution 
obtained via the information from previous iterations. A simple calculation then shows that the 
smallest ball enclosing the intersection of A and B already has a radius shrinking at the rate of 
1 — 4. To achieve the accelerated rate, we make the observation that the gradient information in 
this iteration can also be used to shrink the ball B and therefore, the radius of the enclosing ball 
containing the intersection of A and B shrinks at a faster rate. We detail this intuition in Section 
2. The new optimal method is described and analyzed in Section 3. We conclude with some 
experiments in Section 4. 

1.2 Preliminaries 

We write | ■ | for the Euclidean norm in M”, and B(a;, r^) := {y G M"' : |j/ — xp < r^} (note that 
the second argument is the radius squared). We define the map line_search ; M” x —)■ R” by 


line_search(x, y) = argmin f[x + t{y — x)), 

iSK 


and we denote 



Recall that by strong convexity one has 


Vy e R”, f{y) > /(x) + VfixYiy -x) + ^\y- xp 


which implies in particular: 



Furthermore recall that by smoothness one has f{x~^) < /(x) — ^ | V/(x) p which allows to shrink 
the above ball by a factor of 1 — 4 and obtain the following: 



( 1 ) 


2 Intuition 


In Section 2.1 we describe a geometric alternative to gradient descent (with the same convergence 
rate) which gives the core of our new optimal method. Then in Section 2.2 we explain why one 
can expect to accelerate this geometric algorithm. 
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The left diagram shows the intersection shrinks at the same rate if only one of the ball shrinks; the right 
diagram shows the intersection shrinks much faster if two balls shrinks at the same absolute amount. 


2.1 A suboptimal algorithm 

Assume that we are given a guarantee i?o > 0 on the distanee from some point xq to the optimum, 
that is X* G B(a;o, Rl). Combining this original enelosing ball for x* and the one obtained by (1) 
(with f{x*) < /{xq)) one obtains 

X* e B(xo, Rl)nB(xo- - V/(xo), 

If < -^o(f ~ k) seeond ball already shrinks by a faetor of (1 — -f). In the other 

case when > Roi^ ~ -)> the center of the two balls are far apart and therefore there is a 

much smaller ball containing the intersection of two balls. Formally, it is an easy calculation to see 
that for any g G M”, e: G (0, 1), there exists x eW^ such that 

B(0, 1) n B(^, \g\‘^{l - e)) C B(x, 1 - e). (Figure 1) 

In particular the two above display implies that there exists xi G such that 

X* G B ( xi,i?o 



Denote by T the map from xq to xi defined implicitely above, and let (x^) be defined by Xk+i = 
T{xk). Then we just proved 


la:* - Xfcl < 1 - 


K 


Rl 


In other words, after 2K\og{Ro/e) iterations where each iteration cost one call to the gradient 
oracle) one obtains a point e-close to the minimizer of /. 
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2.2 Why one can accelerate 

Assume now that we are give a guarantee i?o > Osuehthata:* G B{xo, RQ — ^{f{y)—f{x*))) where 
f{xo) < f{y) (say by ehoosing y = xq). Using the faet that /(x^) < f {xo) - ^|V/(a;o)P < 
f^y) ~ ^l^/(^o)P , we obtain that 

X G B (^Xo, i?o--- (/(4) - /(^ )) J 


whieh, intuitively, allows us the shrink the radius squared from to R^ — using the local 

information at xq. From (1), we have 


X* G B 


Iv/MP 

V ° 



Now, intersecting the above two shrunk balls and using Lemma 1 (see below and also see 
Figure 2), we obtain that there is an x( such that 


X* G B Rl (^1 - ^ - f{x*))^ 

giving us an acceleration in shrinking of the radius. To carry the argument for the next iteration, 
we would have required that /(x() < /(xq ) but it may not hold. Thus, we choose Xi by a line 
search 

Xi = line_search (x(,Xq) 

which ensures that / (xi) < /(xq ). To remedy the fact that the ball for the next iteration is centered 
at x'l and not xi, we observe that the line search also ensures that V/(xi) is perpendicular to the 
line going through xi and x(. This geometric fact is enough for the algorithm to work at the next 
iteration as well. In the next section we describe precisely our proposed algorithm which is based 
on the above insights. 


3 An optimal algorithm 

Let xo G M"", Co = Xq and i?g = (l — -i) • For any A; > 0 let 

Xfc+i = line_search (cfc,x^) , 

and Cfc+i (respectively Rl_^_i) be the center (respectively the squared radius) of the ball given by 
(the proof of) Lemma 1 which contains 


B Cfc, Rk 


\^fiXk+l)? 


n B (x+i, 


|V/(ii+i)|= 




1 - - 
K 


The formulas for Ck+i and R^,i are given in Algorithm 1 
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Theorem 1 For any fc > 0, one has x* G B(cfc, R\), < (1 — ^ j Rf., and thus 


\x* -Ck\ < 1 1- y=] Ro- 


Proof We will prove a stronger elaim by induetion that for eaeh k > 0, one has 

X* eB (^Ck, Rl-^ ■ 


The ease k = 0 follows immediately by (1). Let us assume that the above display is true for some 
k>0. Then using f{x*) < < f{xk+i) - ^|V/(xfc+i)p < /(x+) - ^|V/(xfc+i)|2, one 

gets 


X* eB Ri - 

Furthermore by (1) one also has 

|V/(x,+i)P 


|V/(x,+i)|‘ 


B(X++, 


a'^K 


1 - 




K 


-^(/(x^+i)-/(x*))^. 


Thus it only remains to observe that the squared radius of the ball given by Lemma 1 whieh en- 
eloses the interseetion of the two above balls is smaller than R\ — — f{x*)). 

We apply Lemma 1 after moving Ck to the origin and sealing distanees by Rk. We set e = 

9 = 6=1 (/(x+^i) - fix*)) and a = xf^-^ — Ck- The line seareh step of the algorithm 

implies that V/(xfc+i)’^(xfc+i - Ck) = 0 and therefore, |a| = |x++^ - Ck\ > |V/(xfc+i)|/a = g 
and Lemma 1 applies to give the result. ■ 


Lemma 1 Let a G M."' and e G (0,1), 5 ^ G M+. Assume that |a| > g. Then there exists c G M."' such 
that for any <5 > 0, 

B(0,1 — eg"^ — 5) n B(a, g‘^il — e) — 6) C B (c, 1 — \/e — 5) . 

Proof First observe that if < 1/2 then one ean take c = a sinee |(1 — e) <1- ^/e. Thus 
we assume now that g"^ > 1/2, and note that we ean also elearly assume that n = 2. Consider the 
segment joining the two points at the interseetion of the two balls under eonsideration. We denote 
c for the point at the interseetion of this segment and [0, a], and x = |c| (that is c = Xj^). A simple 
pieture reveals that x satisfies 

1 — eg'^ — 6 — x^ = g^il — e) — 6 — (|a| — x)^ x = ^ 

2\a\ 

When X < |a|, neither of the balls eovers more than half of the other ball and henee the 
interseetion of the two balls is eontained in the ball B (^x-j^, 1 — eg"^ — 6 — x^j (See figure 2). 
Thus it only remains to show that x < |a| and that 1 — eg'^ — 6 — x^ < 1 — ^/e — 6. The first 
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Algorithm 1: Minimum Enclosing Ball of the Intersection to Two Balls 
Input: a ball centered at xa with radius Ra and a ball centered at with radius Rb- 

if \xa — xsf > \Ra ~ Rb\ then 

^ \ ^ \ — X ^ _ p2 _ {\^A-Xb\ +R%-Ra) 

c ^Bj- R Rb 4\xa-xb\'^ ■ 


2 ^ > 2\xa-Xb\ 

else if \xa — xb\^ < R\ — R% then 
I c = xb- R = Rb- 

else 

I c = Xa- R = Ra-“ 

end 

Output: a ball centered at c with radius R. 


“If we assume \xa — xb\ > Rb as in Lemma 1, this extra case does not exist. 


inequality is equivalent to |ap + ( 7 ^ > 1 which follows from |ap > > 1/2. The second 


inequality to prove can be written as 


2 (1 + |ap — ^ 


4|a| 


which is straightforward to verify (recall that lap >9‘^> 1/2). 


Algorithm 2 we give is more agressive than Theorem 1 , for instance, using line search instead 
of fixed step size. The correctness of this version follows from a similar proof as Theorem 1 . 

This algorithm does not require the smoothness parameter and the number of iterations; and it 
guarantees the function value is strictly decreasing. They are useful properties for machine learning 
applications because the only required parameter a is usually given. Furthermore, we believe that 
the integration of zeroth and first order information about the function makes our new method 
particularly well-suited in practice. 


4 Experiments 

In this section, we compare Geometric Descent method (GeoD) with a variety of full gradient meth¬ 
ods. It includes steepest descent (SD), accelerated full gradient method (AEG), accelerated full 
gradient method with adaptive restart (AFGwR) and quasi-Newton with limited-memory BEGS 
updating (E-BFGS). For SD, we compute the gradient and perform an exact line search on the gra¬ 
dient direction. For AEG, we use the ‘Constant Step Scheme IE in [Nesterov(2004)] . For AFGwR, 
[ODonoghue and Candes(2013)], we use the function restart scheme and replace the gradient step 
by an exact line search to improve its performance. For both AFG and AFGwR, the parameter 
is chosen among all powers of 2 for each dataset individually. For F-BFGS, we use the software 
developed by Mark Schmidt with default settings (see [Schmidt(2012)]). 

In all experiments, the minimization problem is of the form Lp{ajx) where computing ajx 
is the computational bottleneck. Therefore, if we reuse the calculations carefully, each iteration 
of all mentioned methods requires only one calculation of afx for some x. In particular, the cost 
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Algorithm 2: Geometric Descent Method (GeoD) 

Input: parameters a and initial points xq. 

Xq = line_search(a;o, Xo — Vf{xo)). 

Co = Xo- a-^Vf{xo). 

Rl = - I {f{xo) - /( 4 ))- 

for z ^ 1,2, • • • do 
Combining Step: 

Xk = line_search(a;^_j^, Cfc-i). 

Gradient Step: 

= line_search(a:fc, Xfc — V f{xk)). 

Ellipsoid Step: 

XA = Xk- a-^Vf{xk). R\ = - I {f{xk) - /«))• 

xb = Ck-i. Rl = Rl_^ - I (/«_i) - /«))• 

Let B{ck, R\) is the minimum enclosing ball of B{xa, Ra) G B{xb, RD- 

end 

Output: xt- 


of exact line searches is negligible compares with the cost of computing ajx. Hence, we simply 
report the number of iterations in the following experiments. 

4.1 Binary Classification 

We evaluate the algorithms via the binary classification problem on the 40 datasets' from LIBSVM 
data, [Chang and Lin(201 1)]. The problem is to minimize the regularized empirical risk: 

1 ^ \ 

i=l 

where a* G 6 * G M are given by the datasets, A is the regularization coefficient and (p is the 
smoothed hinge loss function given by 


{ 0 if z > 1 

I - z if z <0 
\{1 — z)^ otherwise. 

We solve this problem with different regularization coefficients A G 10“®, 10“®, 10“^, 10“®} 

and report the median and 90th percentile of the number of steps required to achieve a certain accu¬ 
racy. In figure 3, we see that GeoD is better than SD, AFG and AFGwR, but worse than L-BFGS. 
Since L-BFGS stores and uses the gradients of the previous iterations, it is interesting to see if 
GeoD will be competitive to L-BFGS if it computes the intersection of multiple balls instead of 2 
balls. 

' We omitted all datasets of size >100 MB for time consideration. 
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Median of # Iteration for 200 Problems 




Figure 3: Comparison of full gradient methods on 40 datasets and 5 regularization coefficients with 
smoothed hinge loss function. The left diagram shows the median of the number of iterations required 
to achieve a certain accuracy and the right diagram shows the 90th percentile. 


4.2 Worst Case Experiment 

In this section, we consider the minimization problem 


fix) 


2 


n-l \ 

(1 - xif + - Xi+i)^ + xl I 

i=l J 



2=1 


( 2 ) 


where [3 is the smothness parameter. Within the first n iterations, it is known that any iterative 
methods uses only the gradient information cannot minimize this function faster than the rate 

1 - 0(/3-l/2). 

In figure 4, we see that every method except SD converge in the same rate with different 
constants for the first n iterations. However, after 0(n) iterations, both SD and AFG continue to 
converge in the rate the theory predicted while other methods converge much faster. We remark 
that the memory size of L-BFGS we are using is 100 and if in the right example we choose n = 100 
instead of 200, L-BFGS will converge at n = 100 immediately. It is surprising that the AFGwR 
and GeoD can achieve a comparable result by using “memory size” being 1. 
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Figure 4: Comparision of full gradient methods for the function (2) 
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